Skip to content

UPSTREAM PR #17116: rpc : fix alloc size logic#262

Open
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17116-branch_ggml-org-gg/rpc-fix-alloc-size
Open

UPSTREAM PR #17116: rpc : fix alloc size logic#262
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17116-branch_ggml-org-gg/rpc-fix-alloc-size

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 18, 2025

Mirrored from ggml-org/llama.cpp#17116

fix #16657
ref ggml-org/llama.cpp#16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@loci-review
Copy link

loci-review bot commented Nov 18, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

PR #262 implements a fix for RPC inference when Metal backend is involved, addressing allocation size calculation logic in the RPC system. The changes are contained within the GGML RPC subsystem (ggml-rpc.h and ggml-rpc.cpp) and do not modify core inference functions.

Analysis Results

Performance Metrics: No performance data was available for the specified version comparison, indicating either incomplete analysis pipeline execution or that the changes are too localized to generate measurable performance differences in the core inference path.

Code Changes Scope: The modifications are limited to:

  • RPC protocol version bump (breaking change requiring client/server sync)
  • Enhanced allocation size request structure to include source tensors
  • Null pointer safety improvements in tensor serialization
  • Expanded allocation logic for specific operations (GGML_OP_FLASH_ATTN_EXT, GGML_OP_MUL_MAT_ID)

Core Function Impact: The changes do not affect primary inference functions (llama_decode, llama_encode, llama_tokenize) or other performance-critical components identified in the project structure. The modifications are isolated to RPC backend allocation logic.

Network and Memory Impact: The fix introduces additional RPC message overhead by serializing source tensors (GGML_MAX_SRC * sizeof(rpc_tensor) per allocation request) and increases server-side memory allocation. However, this overhead only affects distributed inference scenarios using RPC backends.

Correctness Benefits: The implementation addresses a fundamental issue where allocation size calculations were insufficient for certain tensor operations, particularly affecting Metal backend compatibility. The fix prevents potential allocation failures that could cause crashes or incorrect results in distributed inference setups.

Binary Impact: Changes affect RPC-enabled binaries (llama-cli, rpc-server) when used with distributed inference configurations. Standard local inference remains unaffected.

The changes represent a targeted correctness fix with minimal performance impact on typical usage patterns. The modifications improve system reliability for distributed inference scenarios while maintaining compatibility with existing local inference workflows.

@loci-dev loci-dev force-pushed the main branch 28 times, most recently from ab559ce to e612b7c Compare November 24, 2025 22:10
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 9239ee7 to 96dc574 Compare November 28, 2025 16:10
@loci-dev loci-dev force-pushed the upstream-PR17116-branch_ggml-org-gg/rpc-fix-alloc-size branch from 590a805 to 4953693 Compare November 28, 2025 17:35
@loci-review
Copy link

loci-review bot commented Nov 28, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp (auroralabs-loci)
PR #262: RPC allocation size logic fix for Metal backend compatibility
Versions Compared: aa09cdea (base) vs 135d56f6 (target)


Analysis Result

No performance changes detected between the baseline and target versions. All 16 binaries show 0.0% change in power consumption. Function-level metrics indicate no measurable differences in response time or throughput across the codebase.

Code Changes: The PR modifies RPC protocol message structures in ggml-rpc.cpp and ggml-rpc.h to include source tensor serialization for allocation size queries. These changes affect RPC communication logic but do not alter the compiled binary behavior for the analyzed versions, suggesting the modifications may not be active in the current build configuration or the analysis captured identical build artifacts.

Inference Impact: No impact on tokens per second. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes.

@loci-dev loci-dev force-pushed the main branch 16 times, most recently from 9368c2d to 50d76f4 Compare December 1, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants